{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Azure AI Document Intelligence"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">[Azure AI Document Intelligence](https://aka.ms/doc-intelligence) (formerly known as `Azure Form Recognizer`) is machine-learning \n",
    ">based service that extracts texts (including handwriting), tables, document structures (e.g., titles, section headings, etc.) and key-value-pairs from\n",
    ">digital or scanned PDFs, images, Office and HTML files.\n",
    ">\n",
    ">Document Intelligence supports `PDF`, `JPEG/JPG`, `PNG`, `BMP`, `TIFF`, `HEIF`, `DOCX`, `XLSX`, `PPTX` and `HTML`.\n",
    "\n",
    "This current implementation of a loader using `Document Intelligence` can incorporate content page-wise and turn it into LangChain documents. The default output format is markdown, which can be easily chained with `MarkdownHeaderTextSplitter` for semantic document chunking. You can also use `mode=\"single\"` or `mode=\"page\"` to return pure texts in a single page or document split by page.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisite\n",
    "\n",
    "An Azure AI Document Intelligence resource in one of the 3 preview regions: **East US**, **West US2**, **West Europe** - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have. You will be passing `<endpoint>` and `<key>` as parameters to the loader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install --upgrade --quiet  langchain langchain-community azure-ai-documentintelligence"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first example uses a local file which will be sent to Azure AI Document Intelligence."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the initialized document analysis client, we can proceed to create an instance of the DocumentIntelligenceLoader:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
    "\n",
    "file_path = \"<filepath>\"\n",
    "endpoint = \"<endpoint>\"\n",
    "key = \"<key>\"\n",
    "loader = AzureAIDocumentIntelligenceLoader(\n",
    "    api_endpoint=endpoint, api_key=key, file_path=file_path, api_model=\"prebuilt-layout\"\n",
    ")\n",
    "\n",
    "documents = loader.load()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default output contains one LangChain document with markdown format content: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2\n",
    "The input file can also be a public URL path. E.g., https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/layout.png."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url_path = \"<url>\"\n",
    "loader = AzureAIDocumentIntelligenceLoader(\n",
    "    api_endpoint=endpoint, api_key=key, url_path=url_path, api_model=\"prebuilt-layout\"\n",
    ")\n",
    "\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 3\n",
    "You can also specify `mode=\"page\"` to load document by pages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
    "\n",
    "file_path = \"<filepath>\"\n",
    "endpoint = \"<endpoint>\"\n",
    "key = \"<key>\"\n",
    "loader = AzureAIDocumentIntelligenceLoader(\n",
    "    api_endpoint=endpoint,\n",
    "    api_key=key,\n",
    "    file_path=file_path,\n",
    "    api_model=\"prebuilt-layout\",\n",
    "    mode=\"page\",\n",
    ")\n",
    "\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output will be each page stored as a separate document in the list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for document in documents:\n",
    "    print(f\"Page Content: {document.page_content}\")\n",
    "    print(f\"Metadata: {document.metadata}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 4\n",
    "You can also specify `analysis_feature=[\"ocrHighResolution\"]` to enable add-on capabilities. For more information, see: https://aka.ms/azsdk/python/documentintelligence/analysisfeature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader\n",
    "\n",
    "file_path = \"<filepath>\"\n",
    "endpoint = \"<endpoint>\"\n",
    "key = \"<key>\"\n",
    "analysis_features = [\"ocrHighResolution\"]\n",
    "loader = AzureAIDocumentIntelligenceLoader(\n",
    "    api_endpoint=endpoint,\n",
    "    api_key=key,\n",
    "    file_path=file_path,\n",
    "    api_model=\"prebuilt-layout\",\n",
    "    analysis_features=analysis_features,\n",
    ")\n",
    "\n",
    "documents = loader.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output contains the LangChain document recognized with high resolution add-on capability:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "documents"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "vscode": {
   "interpreter": {
    "hash": "f9f85f796d01129d0dd105a088854619f454435301f6ffec2fea96ecbd9be4ac"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
